CT-based radiomics for predicting breast cancer radiotherapy side effects

Skin inflammation with the potential sequel of moist epitheliolysis and edema constitute the most frequent breast radiotherapy (RT) acute side effects. The aim of this study was to compare the predictive value of tissue-derived radiomics features to the total breast volume (TBV) for the moist cells epitheliolysis as a surrogate for skin inflammation, and edema. Radiomics features were extracted from computed tomography (CT) scans of 252 breast cancer patients from two volumes of interest: TBV and glandular tissue (GT). Machine learning classifiers were trained on radiomics and clinical features, which were evaluated for both side effects. The best radiomics model was a least absolute shrinkage and selection operator (LASSO) classifier, using TBV features, predicting moist cells epitheliolysis, achieving an area under the receiver operating characteristic (AUROC) of 0.74. This was comparable to TBV breast volume (AUROC of 0.75). Combined models of radiomics and clinical features did not improve performance. Exclusion of volume-correlated features slightly reduced the predictive performance (AUROC 0.71). We could demonstrate the general propensity of planning CT-based radiomics models to predict breast RT-dependent side effects. Mammary tissue was more predictive than glandular tissue. The radiomics features performance was influenced by their high correlation to TBV volume.

Several similar studies have investigated the use of ML and various types of imaging data to predict RT side effects in breast cancer patients.Research utilizing dosiomics features extracted from CT images managed to accurately predict acute skin toxicity 20 .Another study using electron density and biologically effective dose radiomics effectively predicted late radiation-induced subcutaneous fibrosis 21 .Additionally, a comprehensive review of ML models analyzed RT-induced complications across multiple cancer sites, including breast cancer 22 .Collectively, these studies emphasize the growing interest in using ML and imaging data to mitigate RT side effects.
The objective of this study was to develop a statistically reliable assessment of the predictive capability of radiomics features to predict the most prevalent RT side effects of moist epitheliolysis as a surrogate for skin inflammation and edema based on the total breast volume (TBV) and glandular tissue (GT).

Clinical data collection and curation
The dataset consisted of 252 breast cancer patients who underwent radiotherapy between 2012 and 2016 in the Rechts der Isar university hospital of the technical university of Munich (TUM).For the patient data acquired at TUM, retrospective analysis of patient records and data is generally allowed following Article 27 of the Bavarian Hospital act (Bayerisches Krankenhausgesetz) from the Landeskrankenhausgesetz des Freistaates Bayern.Informed consent for treatment was obtained from every patient.Institutional Review Board (IRB) was acquired from the review board of TUM (reference number 466/16 s.Clinical variables were defined based on a literature review on known clinical predictors from previous publications.Moreover, variables were selected based on broad availability of data that hindered the assessment of other predictive factors 23,24 : smoker status, chemotherapy received, radiotherapy boost, the maximum prescribed radiation dose in equivalent dose at 2 Gy (EQD2, ∝ /β = 3), TBV, and the two targets of prediction: (i) moist cell epitheliolysis as surrogate for common terminology criteria for adverse effects (CTCAE) grade 2 skin inflammation 25 (33 positive cases; referred henceforth simply as moist epitheliolysis); and (ii) presence of any edema (26 positive cases).

Radiomics data collection and curation
Prior to RT treatment, planning CT images of the breast were conducted.Figure S1 shows que acquisition parameters for these CT images.Exclusion criteria encompassed breast implant and mastectomy cases.Two separate volume of interest (VOI) definitions were segmented, creating two radiomics cohorts: TBV, containing radiomics information from the whole breast tissue; and glandular tissue (GT), which contained radiomics information only from this tissue.Patient outcome assessment was performed retrospectively by a medical student after thorough teaching by a radiation oncologist (JCP).All methodology has been conducted in accordance to the relevant guidelines and regulations.
Segmentation of the volumes of interest was manually performed by NW, using 3D Slicer 26 .GT was defined using the fast growcut function.BSpline interpolation was used to perform isotropic resampling to obtain a voxel size of 1 × 1 × 1 mm.Image discretization was carried out with a fixed bin width of 10.Laplacian of Gaussian filtering was used for image reconstruction (Sigma values of 1.0, 2.0, 3.0, 4.0 and 5.0).
Radiomics features were extracted and filtered from the CT images and both segmentations using the Python library PyRadiomics 27 (version 3.0.1;Python version 3.8.10).A total of 104 features were obtained for each of the radiomics cohorts, which included first-order, shape, and texture features (the latter is composed of "gray-level co-occurrence matrix", "gray-level size-zone matrix", "gray-level run-length matrix", "neighboring gray-tone difference matrix", and "gray-level dependence matrix" features).Figure 1 shows a diagram of the clinical and radiomics features and side effects collection process from the patients.Further, Fig. S2 shows the distribution of patients across all clinical features and side effects measured.

Feature pre-processing and hyperparameter optimization
Repeated nested cross-validation was employed to train and validate the models.Normalization of the radiomics features was performed using min-max normalization, in order to conserve the original distribution in the [0, 1] range.
For each cohort, the most interesting features were selected and evaluated in two different ways: the first one, with a double Spearman rank correlation test, first within each dataset with a cut-off value of 0.9 to remove redundant features; and then towards each side effect prediction target, in order to keep the most relevant features.The second option was selecting features using minimum redundancy-maximum relevance (MRMR; version 1.0.2),which incorporates both tests in a single step 28 .In both cases, an estimation of the information density and, therefore, of the number of features to select, was made using Principal Component Analysis (PCA).For the TBV radiomics feature set, an average of 23 and 39 features were selected when using MRMR and a double Spearman rank correlation test, respectively.For the GT radiomics feature set, on the other hand, an average of 26 and 44 features were selected when using each of the feature selection techniques, respectively.
Before finding the optimal hyperparameter values, the class imbalance of the different side effect prediction targets was corrected depending on the level of disproportion.Moist epitheliolysis and edema had a ratio of 6.64:1 and 8.69:1 of negative to positive class sizes, respectively, and were therefore corrected using a combination of synthetic minority over-sampling technique (SMOTE; imbalance-learn library version 0.11.0) 29 to a ratio of 2:1, and random under-sampling of the majority class to a ratio of 1.25:1.The choice of ratios for each step was made to find a balance between avoiding excessive oversampling and losing too many samples while undersampling.Balanced accuracy (BA) was the metric used as optimization criteria for the values of the hyperparameters, capable of handling the small remainder of class imbalances.Hyperparameter optimization was conducted using an exhaustive grid search, where all combinations of hyperparameter values are tested in the validation set of the innermost fold until the optimal values are found.

Machine learning modeling
Four ML algorithms were implemented and evaluated: logistic regression (LR), used for its simplicity and efficiency in binary classification tasks with a low feature set dimensionality 30,31 ; least absolute shrinkage and selection operator (LASSO), a variant with an optimizable regularization term that can potentially better handle imbalanced datasets 32 ; support vector machine (SVM), a high flexibility algorithm thanks to the implementation of multiple kernels and explore non-linear relationships in the data 33 ; and random forest classifier (RF), an ensemble learning, decision tree-based method that is more robust to overfitting effects 34 .All models were imported from the python library scikit-learn (version 1.0.2) 35 .These models were contrasted against clinical model baselines.
After comparing the four model types for each of the radiomics cohorts and feature selection types, the best models were retrained and optimized adding clinical data in order to assess whether a combined model yields a better performance in predicting the presence of any side effect.The workflow followed by the ML pipeline is shown in Fig. 2. In addition, larger reference images of the respective VOIs can be seen in Fig. S1.
Feature selection has been analyzed for all relevant models, estimating a score based on the feature importance assigned by the models and how often each feature was selected.The resulting score is calculated as Score = Feature Importance/[(n + 1) − m] , where n is the number of models, and m is the number of times the feature has been chosen.
Finally, the correlation between the breast volume and the prediction probability of the best model has been analyzed to study the overall impact of the breast volume in the predictive value of radiomics features.An additional model was evaluated where radiomics features that highly correlated to the breast volume were excluded (Spearman correlation higher than 0.8), using the best performing configuration.The objective was to assess the impact of volume-correlated features on the performance of radiomics models.

Statistical analysis
Training and validation of the different models were performed using 50 repetitions of nested cross-validation (5 outer folds, 4 inner folds).This resampling technique provides additional statistical robustness, resulting in 250 final models that were aggregated to the final test results.
In order to gather more information from the radiomics features, PCA was employed as an estimation of the information density within this dataset.The variance retention by the components of PCA was used to understand the intrinsic dimensionality of our dataset.However, since the components generated by PCA are a different combination from the original features and, generally, more packed, these components should not be used as a feature selection replacement, but as an estimation.The reason behind it is the inherent added difficulty of tracing the feature importance back to the original features.
In the inner fold of the nested cross-validation normalization, feature selection and class imbalance correction were applied, in order to avoid data leakage from any training split to the validation (inner fold) or test splits (outer fold).
One of the two feature selection techniques mentioned in this study is the use of a double Spearman rank correlation test.This approach is intended to optimize feature selection by addressing redundancy and relevance in two distinct steps.First, redundancy is removed so that features that do not provide additional information are eliminated.Second, the Spearman rank correlation test is applied again comparing the dataset and the predictor, selecting instead the features that are most relevant to the prediction target.

Chemotherapy received
Extra radiation dose targeted on the base of the tumor

First-order Shape Texture
Gray-level cooccurrence matrix The performance of the aggregated models was measured using a combination of metrics: BA, F1, precision, recall, specificity, area under the receiver-operator curve (AUROC) and Matthew's correlation coefficient (MCC).Metrics are given with 1.96 standard errors for a confidence interval of 95%.ROC curves were also used to evaluate the trade-off between the sensitivity and specificity across different decision thresholds, and to assess the discrimination power between classes of each of the models.

Results
We evaluated the possibility of predicting side effects of RT in breast cancer (moist cells epitheliolysis as a surrogate for skin inflammation and edema) based on the total breast volume (TBV), glandular tissue (GT) and using clinical features.Table 1 summarizes the results that are shown throughout this section.The feature importance was calculated for the best performing radiomics and clinical models (Table 2 and Table S8, respectively).

Side effect prediction
The ROC performance of the best trained models to predict both side effects can be seen in Fig. 3.More scores regarding the comparison of side effects as the prediction target can be seen in Table S2.In addition, the calibration curve of the best performing radiomics model is shown in Fig. S5.
While the edema models performed only slightly above random (best AUROC value of 0.55), both radiomics feature sets have shown a notable predictive value towards moist cells epitheliolysis using the LASSO classifier: an AUROC of 0.74 when using TBV, and an AUROC of 0.65 when training a RF on the GT radiomics feature set, whose features were selected using MRMR.Therefore, models trained to predict moist cells epitheliolysis perform better than predicting edema regardless of the feature selection technique, ML algorithm, or the training radiomics feature set used.
The ROC performance of both radiomics cohorts, with the clinical features as baseline, are shown in Fig. 4. The clinical model achieved an AUROC of 0.7.More scores regarding the comparison of the predictive power of each radiomics feature set and the clinical baseline can be seen in Table S3.Only the best performing ML Fig. 2. Workflow of the pipeline used in the study to analyze both clinical and radiomics data.On the left half of the workflow: clinical features were obtained from all patients with CT imaging available, the respective VOIs (TBV and GT) were segmented, and the subsequent radiomics features extracted.On the right half of the workflow: for each evaluated dataset, a 50-repeat nested cross-validation was performed.Within the inner fold normalization, feature selection and an exhaustive grid search for optimal hyperparameters was performed.algorithms are being shown, according to the evaluation of the four types of models (shown in Table S4).An additional analysis of the best feature selection approach has been made (Table S5).

Combined modelling
Figure 5 shows the best performing models using combined datasets of either radiomics feature sets and clinical features.More scores regarding the comparison of the combinations of radiomics feature sets with clinical features, and their respective predictive performance, can be found in Table S6.When combining TBV radiomics features with clinical ones, LASSO performed best when predicting moist cells epitheliolysis (AUROC of 0.73) although without any overall improvement.RF performed best when predicting edema (AUROC of 0.53), though just above random.

Feature importance
Table 2 shows the feature importance scores for the best models.To account for both the importance score and the frequency by which a feature was chosen, we computed a score that was the product of these values and ranked the features accordingly.
From the list of the 15 most predictive features, more than half of them belong to the shape type, confirming that planar and volumetric information has a significant influence on the performance of oncological ML models [36][37][38] .

Predictive influence of the total breast volume
The influence of the volume of the whole breast on the prediction quality of the models has been further analyzed.A logistic regression model has been trained only on TBV breast volume, with an AUROC of 0.75 ± 0.01, performing similarly to the best model trained on all TBV radiomics features.Table 2. Top 15 feature importance report of the best performing model: a LASSO classifier trained on TBV radiomics features, selected by MRMR, and predicting moist cells epitheliolysis as a surrogate for skin inflammation.This report shows how often a feature has been chosen out of the 250 iterations (% chosen), the feature importance value given by the model (importance; LASSO coefficients), and a score encompassing the feature importance value and how often that feature was selected (product of these two values).www.nature.com/scientificreports/Over all 250 runs, there was a median Spearman correlation coefficient of 0.82 between TBV breast volume and the best radiomics model.Figure S3 shows the Spearman's correlation distribution between the TBV breast volume and the prediction probabilities of the best performing model.The distribution of the respective p-values can be seen in Fig. S4.The p-values of their correlation to the prediction probabilities of the model were significant (p < 0.05) in 243 runs.
The predictive influence of the breast volume has been evaluated by retraining the best performing model, but excluding all features with a Spearman correlation coefficient higher than 0.8.These results can be seen in Table S7.With an AUROC of 0.71, performance has slightly but significantly decreased (from an AUROC of 0.74), confirming an effect of the breast volume on the performance of these radiomics models.

Discussion
In this study, we analyzed the relevance of CT-based radiomics to predict two common RT side effects: epitheliolysis of moist cells as a surrogate for skin inflammation; and edema, using a statistically robust pipeline.The best prediction model was a LASSO classifier that was trained on radiomics features from the TBV and selected using MRMR, predicting moist cells epitheliolysis.This model achieved a moderate discriminatory power with an AUROC of 0.74.Clinical features alone or in combination with radiomics did not significantly improve predictive performances.
In contrast, edema was more difficult to predict with a performance level just above random (AUROC score of 0.55 for the best model).The best radiomics model for moist cell epitheliolysis was largely correlated to the TBV volume which itself showed the same reasonable predictive performance with an AUROC of 0.74.
These results have uncovered the previously known fact that radiomics features are largely correlated with the size of the VOI 39,40 .Eliminating volume-correlated features slightly mitigated the performance of the radiomics model (AUROC of 0.71 from 0.74).As consequence, radiomic features do carry relevant information for the prediction of radiotherapy side effects.However, these features are less predictive than TBV volume.
The analysis of the importance of other features revealed several logical patterns.First, shape features appeared to be the most influential ones, indicating that geometrical features play a dominant role in predicting RTdependent side effects.Maximum D Diameter Column being the most influential feature supports this idea, implying that larger tumors or more irregular tumors may cause more adverse effects to RT due to how the dose distribution is made, and how it affects the neighboring tissue.Further, the presence of multiple gray level types of features suggests that the heterogeneity of the tumor tissue is another significant factor, possibly due to how different types of tissues may react to RT, and the side effects that appear as a cause of this non-uniformity 41,42 .
Naturally, the given radiation dose is a decisive factor for development or RT-dependent side effects.The dose was part of the clinical prediction model achieving a decent predictive performance albeit inferior to the TBV volume.In fact, the radiation doses given were largely similar, yielding low variability and thus predictive value.Moreover, this cohort was solely treated with normofractionated RT (conventional RT dose fractionation schedule).The START B trial, however, could also demonstrate the predictive performance of breast size on physician-assessed normal tissue effects in the breast 43 .
While LASSO yielded the overall best results, all other ML algorithms have proven to be on a similar level.Only SVM has performed slightly but statistically worse, with an AUROC of 0.69 on the best configuration (compared to LASSO: AUROC of 0.74).The choice of algorithm is relevant but does not affect the performance of the model, as long as the model is optimized and properly trained.The choice of the feature selection technique had a small impact on the overall performance, managing to reduce the data dimensionality without losing much information.
This study is subject to two main limitations.The first one stems from the retrospective nature of our side effect data, deriving from past patient records, which presents a challenge to data quality.To this end, we decided to predict moist cells epitheliolysis as it constitutes a binarized endpoint describing more aggravated skin inflammation.On the other hand, the detection and extent of edema was completely dependent on the subjective physician assessment.The second limitation regards the absence of an external validation cohort for an unbiased estimation of the performance of our models.To compensate for this and have a more reliable and unswayed model performance assessment, we decided to apply a more robust resampling technique, in this case a 50-repeat nested cross-validation.

Conclusions
To conclude, the radiomics models developed in this study have shown a reasonable prediction power towards the epitheliolysis of moist cells side effect, while clinical features yielded intermediate albeit competitive results.Adding information from the whole breast tissue, instead of just glandular tissue, achieved better results overall.The radiomics prediction probabilities were largely correlated to breast volume which remained the most predictive feature, though this correlation only affected to a small extent the prediction power of radiomics features in general.These findings, however, should be further validated on larger, more diverse and multi-centered datasets.Future studies should investigate the potential variations in RT side effects prediction using radiomics information depending on the subtype and stage of breast cancer.

Fig. 1 .
Fig. 1.Patient and data flowchart.In the left and central branches, the clinical and radiomics features can be found, respectively.The right branch shows the three RT side effects used as prediction targets.

Fig. 3 .
Fig. 3. Test ROC curves of the best performing models for each of the side effects predicted.Moist cells epitheliolysis: LASSO classifier trained on TBV radiomics features, selected by MRMR.Edema: LASSO classifier trained on GT radiomics features, selected by Spearman rank correlation.

Fig. 4 .
Fig. 4. Test ROC curves of the best performing models depending on the training data used for moist epitheliolysis.TBV radiomics features: LASSO classifier, with features selected by MRMR, predicting moist cells epitheliolysis.GT radiomics features: RF classifier, with features selected by MRMR, predicting moist cells epitheliolysis.Clinical baseline: LR classifier, with features selected by Spearman rank correlation, predicting moist cells epitheliolysis.

Fig. 5 .
Fig. 5. Test ROC curves of the best performing model trained on a single feature set and the best performing model trained on a combined feature set.Best single feature set model: LASSO classifier trained on TBV radiomics features, selected by MRMR, predicting moist cells epitheliolysis.Best combined feature set model: LASSO classifier trained on a combination of TBV and clinical features, selected by MRMR, predicting moist cells epitheliolysis.

Table 1 .
Summary of the best AUROC performances for a given feature set used for training and side effect predicted.